A multitask transfer learning framework for the prediction of virus-human protein-protein interactions

Viral infections are causing significant morbidity and mortality worldwide. Understanding the interaction patterns between a particular virus and human proteins plays a crucial role in unveiling the underlying mechanism of viral infection and pathogenesis. This could further help in the prevention and treatment of virus-related diseases. However, the task of predicting protein-protein interactions between a new virus and human cells is extremely challenging due to scarce data on virus-human interactions and fast mutation rates of most viruses. We developed a multitask transfer learning approach that exploits the information of around 24 million protein sequences and the interaction patterns from the human interactome to counter the problem of small training datasets. Instead of using hand-crafted protein features, we utilize statistically rich protein representations learned by a deep language modeling approach from a massive source of protein sequences. Additionally, we employ an additional objective which aims to maximize the probability of observing human protein-protein interactions. This additional task objective acts as a regularizer and also allows to incorporate domain knowledge to inform the virus-human protein-protein interaction prediction model. Our approach achieved competitive results on 13 benchmark datasets and the case study for the SAR-CoV-2 virus receptor. Experimental results show that our proposed model works effectively for both virus-human and bacteria-human protein-protein interaction prediction tasks. We share our code for reproducibility and future research at https://git.l3s.uni-hannover.de/dong/multitask-transfer.


Introduction
Virus infections cause an enormous and ever increasing burden on healthcare systems worldwide. The ongoing COVID-19 pandemic caused by the zoonotic virus, SARS-CoV-2, has resulted in enormous socio-economic losses [1]. Viruses infect all life forms and require host cells to complete their replication cycle by utilizing the host cell machinery. Virus infection involves several types of protein-protein interactions (PPIs) between the virus and its host. These interactions include the initial attachment of virus coat or envelope proteins to host membrane receptors, hijacking of the host translation and intracellular transport machineries resulting in replication, assembly and subsequent release of virus particles [2,3,4]. Besides providing arXiv:2111.13346v1 [cs.LG] 26 Nov 2021 mechanistic insights into the biology of infection, knowledge of virus-host interactions can point to essential events needed for virus entry, replication, or spread, which can be potential targets for the prevention, or treatment of virus-induced diseases [5].
In vitro experiments based on yeast-two hybrid (Y2H), ligand-based capture MS, proximity labeling MS, and protein arrays have identified tens of thousands of virushuman protein interactions [6,7,8,9,10,11,12,13,14]. These interaction data are deposited in publicly available databases including InAct [15], VirusMetha [16], VirusMINT [17], and HPIDB [18], and others. However, experimental approaches to unravel PPIs are limited by several factors, including the cost and time required, the generation, cultivation and purification of appropriate virus strains, the availability of recombinantly expressed proteins, generation of knock in or overexpression cell lines, availability of antibodies and cellular model systems. Computational approaches can assist in vitro experimentation by providing a list of most probable interactions, which actual biological experimentation techniques can falsify or verify.
In this work, we cast the problem of predicting virus-human protein interactions as a binary classification problem and focus specifically on emerging viruses that has limited experimentally verified interaction data.

Key Challenges in learning to predict virus-Human PPI
Limited interaction data. One of the main challenges in tackling the current task as a learning problem is the limited training data. Towards predicting virushost PPI, some known interactions of other human viruses collected from wet-lab experiments are employed as training data. The number of known PPIs is usually too small and thus, not representative enough to ensure the generalizability of trained models. In effect, the trained models might overfit the training data and would give inaccurate predictions for any given new virus.
Difference to other pathogens. A natural strategy to overcome the limitation posed by scarce virus protein interaction data is to employ transfer learning from available intra-species PPI or PPI data for other types of pathogens. This may, in its simplest fashion, not be a viable strategy as virus proteins can differ substantially from human or bacterial proteins. Typically, they are highly structurally and functionally dynamic. Virus proteins often have multiple independent functions so that they cannot be easily detected by common sequencestructure comparison [19,20,21]. Besides, virus protein sequences of different species are highly diverse [22]. Consequently, models trained for intra-species human PPI [23,24,25,26,27] or for other pathogen-human PPI [28,29,30,31,32,33] cannot be directly used to predict virus-human protein interactions.
Limited information on structure and function of virus proteins. While for human proteins, researchers can retrieve information from many publicly available databases to extract features related to their function, semantic annotation, domains, structure, pathway association, and intercellular localization, such information is not readily available for most virus proteins. Protein crystal structures are available for some virus proteins. However, for many, predictive structures based on the amino acid sequence must be used. Thus, for the majority of virus proteins, currently, the only reliable source of virus protein information is its amino acid sequence. Learning effective representations of the virus proteins, therefore, is an important step towards building prediction models. Heuristics such as K-mer amino acid composition are bound to fail as it is known that virus proteins with completely different sequences might show similar interaction patterns.

Our Contributions
In this work, we develop a machine learning model which overcomes the above limitations in two main steps, which are described below.
Transfer Learning via Protein Sequence Representations. Though the training data on interactions as well as the input information on protein features are limited, a large number of unannotated protein sequences are available in public databases like UniProt. Inspired by advancements in Natural Language Processing, Alley et al. [34] trained a deep learning model on more than 24 million protein sequences to extract statistically meaningful representations. These representations have been shown to advance the state-of-the-art in protein structure and function prediction tasks. Rather than using hand-crafted protein sequence features, we use the pre-trained model by [34] (referred to as UniRep) to extract protein representations. The idea here is to exploit transfer learning from several million sequences to our scant training data.
Incorporating Domain Information. We further fine-tune UniRep's globally trained protein representations using a simple neural network whose parameters are learned using a multitask objective. In particular, besides the main task, our model is additionally regularized by another objective, namely predicting interactions among human proteins. The additional objective allows us to encode (human) protein similarities dictated by their interaction patterns. The rationale behind encoding such knowledge in the learnt representation is that the human proteins sharing similar biological properties and functions would also exhibit similar interacting patterns with viral proteins. Using a simpler model and an additional side task helps us overcome overfitting, which is usually associated with models trained with small amounts of training data.
We refer to our model as MultiTask Transfer (MTT) and is further illustrated in Section 3. To sum up, we make the following contributions.
• We propose a new model that employs a transfer learning-based approach to first obtain the statistically rich protein representations and then further refines them using a multitask objective. • We evaluated our approach on several benchmark datasets of different types for virus-human and bacteria-human protein interaction prediction. Our experimental results (c.f. Section 5) show that MTT outperforms several baselines even on datasets with rich feature information. • Experimental results on the SARS-CoV-2 virus receptor shows that our model can help researchers to reduce the search space for yet unknown virus receptors effectively.

Related work
Existing work mainly casts PPI prediction task as a supervised machine learning problem. Nevertheless, the information about non-interacting protein pairs is usually not available in public databases. Therefore, researchers can only either adapt models to learn from only positive samples or employ certain negative sampling strategy to generate negative examples for training data. Since the quality and quantity of the generated negative samples would significantly affect the outcome of the learned models, the authors in [35,31,36] proposed models that only learned from the available known positive interactions. Nourani et al. [36] and Li et al. [31] treated the virus-human PPI problem as a matrix completion problem in which the goal was to predict the missing entries in the interaction matrix. Nouretdinov et al. [35] use a conformal method to calculate p-values/confidence level related to the hypothesis that two proteins interact based on similarity measures between proteins. Another line of work which casts the problem as a binary classification task focussed on proposing new negative sampling techniques. For instance, Eid et al [22] proposed Denovo -a negative sampling technique based on virus sequence dissimilarity. Mei et al. [37] proposed a negative sampling technique based on one class SVM. Basit et al. [33] offered a modification to the Denovo technique by assigning sample weights to negative examples inversely proportional to their similarity to known positive examples during training.
Dick et al. [30] utilizes the interaction pattern from intra-species PPI networks to predict the inter-species PPI between human-HIV-1 virus and human. Though the results are promising, this cannot be directly applied to completely new viruses where information about closely-related species is not available or to viruses whose intra-species PPI information is not available.
The works presented in [38,39,40,41,42,43,44] employed different feature extraction strategies to represent a virus-human protein pair as a fixed-length vector of features extracted from their protein sequences. Instead of hard-coding sequence feature, Yang et al. [45] and Lanchantin et al. [46] proposed embedding models to learn the virus and human proteins' feature representations from their sequences. However, their training data was limited to around 500,000 protein sequences. Though not very common, other types of information/features were also used in some proposed models besides sequence-based features. Those include protein functional information (or GO annotation) as in [47], proteins domain-domain associations information as in [48], protein structure information as in [49,32], and the disease phenotype of clinical symptoms as in [47]. One limitation of these approaches is that they cannot be generalized to novel viruses where such kind of information is not available.
Among the network-based approaches, Liu et al. and Wang et al. [50,51] constructed heterogeneous networks to compute virus and human proteins features. Nodes of the same type were connected by either weighted edges based on their sequence similarity or a combination of sequence similarity and Gaussian Interaction Profile kernel similarity. Deng et al. [43] proposed a deep-learning-based model with a complex architecture of convolutional and LSTM layers to learn the hidden representation of virus and human proteins from their input sequence features along with the classification problem. Despite the promising performance, those studies still have the limitation posed by hand-crafted protein features.

Method
We first provide a formal problem statement.
Problem Statement. We are given protein sequences corresponding to infectious viruses and their known interactions with human proteins. Given a completely new (novel) virus, its set of protein(s) V along with its (their) sequence(s), we are interested in predicting potential interactions between V and the human proteins.
We cast the above problem as that of binary classification. The positive samples consist of pairs of virus and human proteins whose interaction has been verified experimentally. All other pairs are considered to be non-interacting and constitute the negative samples. In Section 4, we add details on positive and negative samples corresponding to each dataset.
Summary of the approach. The schematic diagram of our proposed model is presented in Figure 1. As shown in the diagram, the input to the model is the raw human and virus protein sequences which are passed through the UniRep model to extract low dimensional vector representations of the corresponding proteins. The extracted embeddings are then passed as initialization values for the embedding layers. These representations are further fine-tuned using the Multilayer Perceptron (MLP) modules (shown in blue). The fine-tuning is performed while learning to predict an interaction between two human proteins (between proteins A and B in the figure) as well as the interaction between human and virus proteins (between proteins B and C). In the following, we describe in detail the main components of our approach.

Extracting protein representations
Significance of using protein sequence as input. We note that the protein sequence determines the protein's structural conformation (fold), which further determines its function and its interaction pattern with other proteins. However, the underlying mechanism of the sequence-to-structure matching process is very complex and cannot be easily specified by hand-crafted rules. Therefore, rather than using hand-crafted features extracted from amino acid sequences, we employ the pre-trained UniRep model [34] to generate latent representations or protein embeddings. The protein representations extracted from UniRep model are empirically shown to preserve fundamental properties of the proteins and are hypothesized to be statistically more robust and generalizable than hand-crafted sequence features.
UniRep for extracting sequence representations. In particular, UniRep consists of an embedding layer that serves as a lookup table for each amino acid representation. Each amino acid is represented as an embedding vector of 10 dimensions. Each input protein sequence of length N will be denoted as a two-dimensional Figure 1: Our proposed MTT model for the virus-human PPI prediction problem. The UniRep embeddings are used to initialize our embedding layers which will be further fine-tuned by the two PPI prediction tasks. Sharing representation for human proteins further enables us to transfer the knowledge learned from the human PPI network to inform our virus-host PPI prediction task. matrix of size Nx10. That two-dimensional matrix will then feed as input to a Multiplicative Long Short Term Memory (mLSTM) network of 1900 units. The 1900 dimension is selected experimentally from a pool of architectures that require different numbers of parameters as described in [52], namely, a 1900-dimensional single layer multiplicative LSTM (∼18.2 million parameters), a 4-layer stacked mLSTM of 256 dimensions per layer (∼1.8 million parameters), and a 4-layer stacked mLSTM with 64 dimensions per layer (∼0.15 million parameters). The output from mLSTM is a 1900 dimensional embedding vector that serves as the pre-trained protein embedding for the input protein sequence. We use the calculated pre-trained virus and human protein embeddings to initialize our embedding layers. The two supervised PPI prediction tasks will further fine-tune those embeddings during training.

Learning framework
We further fine-tune these representations by training two simple neural networks (single layer MLP with ReLu activation) using an additional objective of predicting human PPI in addition to the main task. More precisely, the UniRep representations will be passed through one hidden layer MLPs with ReLU activations to extract the latent representations. Let X denote the embedding lookup matrix. The ith row corresponds to the embedding vector of node i. The final output from MLP layers for an input v is then given by hid(v) = MLP(X(v)). To predict the likelihood of interaction between a pair (v 1 , v 2 ) we first perform an element-wise product of the corresponding hidden vectors (output of MLPs) and pass it through a linear layer followed by sigmoid activation. In the following we provide a detailed description of our multi-task objective.

Training using a Multi-Task Objective
Let Θ, Φ denote the set of learnable parameters corresponding to fine-tuning components (as shown in Figure 1 in green and blue boxes), i.e., the Multilayer Perceptrons (MLP) corresponding to the virus and human proteins, respectively. Let W 1 , W 2 denote the two learnable weight matrices (parameters) for the linear layers (as depicted in gray boxes in the Figure). We use V H, and HH to denote the training set of virus-human, human-human PPI, correspondingly. We use binary cross entropy loss for predicting virus-human PPI predictions, as given below: where variables z vh is the corresponding binary target variable and y vh is the predicted likelihood of observing virus-human protein interaction, i.e., where σ(x) = 1/1 + e −x is the sigmoid activation and denotes the element-wise product.
For human PPI, we predict the confidence score of observing an interaction between two human proteins. More specifically, we directly predict z hh -the normalized confidence scores for interaction between two human proteins as collected from STRING [53] database. Predicting the normalized confidence scores helps us overcome the issues with defining negative interactions. We use mean square error loss to compute the loss for the human PPI prediction task as below where y hh is computed similar to (2) for human proteins and N is the number of (h, h ) pairs.
We use a linear combination of the two loss functions to train our model.
where α is the human PPI weight factor.

Data Description and Experimental set up
We commence by describing the 13 datasets used in this work to evaluate our approach.

Benchmark datasets 4.1.1 The realistic host cell-virus testing datasets
The Novel H1N1 and Novel Ebola datasets. We retrieve the curated or experimentally verified PPIs between virus and human from four databases: APID [54], In-tAct [15], VirusMetha [16], and UniProt [55] using the PSICQUIC web service [56]. Negative sampling techniques such as the dissimilarity-based method [22], the exclusive co-localization method [57,58] are usually biased as they restrict the number of tested human proteins. It is also unrealistic for a new virus because information about such restricted human protein set, generated from filtering criteria based on the positive instances, is typically unavailable. For those reasons, we argue that random negative sampling is the most appropriate, unbiased approach to generate negative training/testing samples. Since the exact ratio of positive:negative is unknown, we conducted experiments with different negative sample rates. In our new virus-human PPI experiments, we try four negative sample rates: [1,2,5,10]. In addition, to reduce the bias of negative samples, the negative sampling in the training and testing set is repeated ten times. In the end, for each dataset, we test each method with 4x4x10 = 160 different combinations of negative training and negative testing sets (with fixed positive training and test samples). The statistics for our new testing datasets are given in Table 1.
The data was retrieved from the HPIDB database [18] to include all Pathogen-Host interactions that have confidence scores available and are associated with an existing virus family in the NCBI taxonomy [59]. After filtering, the dataset includes 24,678 positive interactions and 1,066 virus proteins from 14 virus families. We follow the same procedure as mentioned in [47]

The specialized testing datasets
The dataset with protein motif information (Denovo SLiM [22]). The Denovo SLiM dataset Virus-human PPIs were collected from VirusMentha database [16]. The presence of Short Linear Motif (SLiM) in virus sequences was used as a criterion for data filtering. SLiMs are short, recurring patterns of protein sequences that are believed to mediate protein-protein interaction [60,61]. Therefore, sequence motifs can be a rich feature set for virus-human PPI prediction tasks. The test set [22] contained 425 positives and 425 negative PPIs (Supplementary file S12 used in DeNovo's study ST6). The training data consisted of the remaining PPI records and comprised of 1590 positive and 1515 negative records for which virus SLiM sequence is known and 3430 positives and 3219 negatives without virus SLiM sequences information. Denovo slim negative samples were also generated using the Denovo negative sampling strategy (based on sequence dissimilarity).
The Barman's dataset [48] with protein domain information. The dataset was retrieved from VirusMINT database [17]. Interacting protein pairs that did not have any "InterPro" domain hit were removed. In the end, the dataset contained 1,035 positives and 1,035 negative interactions between 160 virus proteins of 65 types and 667 human proteins. 5-Fold cross-validation was then employed to test each method's performance.

The bacteria human PPI prediction task.
We evaluate our method on three datasets for three human pathogenic bacteria: Bacillus anthracis (B1), Yersinia pestis (B2), and Francisella tularensis (B3), which were shared by Fatma et al. [22]. The data was first collected from HPIDB [18]. B1 belongs to a bacterial phylum different from that of B2 and B3, while B2 and B3 share the same class but differ in their taxonomic order. B1 has 3057 PPIs, B2 has 4020, and B3 has 1346 known PPIs. A sequence-dissimilarity-based negative sampling method was employed to generate negative samples. For each bacteria protein, ten negative samples were generated randomly. Each of the bacteria was then set aside for testing, while the interactions from the other two bacteria were used for training. For simplicity, we use the name of the bacteria in the testing set as the name of the dataset. The statistics for those three datasets are presented in table 2. Table 2: Our bacteria-human PPI benchmark datasets' statistics. |E + | and |E − | refer to the number of positive and negative interactions, respectively. |V h | and |V b | are the number of human proteins and bacteria proteins.

Training data
Testing data • MotifTransformer [46]: It is a transformer-based deep neural network that pre-trains protein sequence representations using unsupervised language modeling tasks and supervised protein structure and function prediction tasks. These representations are used as input to an order-independent classifier for the PPI prediction task. • DeNovo [22]: This model trained an SVM classifier on a hand-crafted feature set extracted from the K-mer amino acid composition information using a novel negative sampling strategy. Each protein pair is represented as a vector of 686 dimensions. • DeepViral [47]: It is a deep learning-based method that combines information from various sources, namely, the disease phenotypes, virus taxonomic tree, protein GO annotation, and proteins sequences for intra-and interspecies PPI prediction. • Barman [48]: It used an SVM model trained on a feature set consisting of the protein domain-domain association and methionine, serine, and valine amino acid composition of viral proteins. • 2 simpler variants of MTT: Towards ablation study, we evaluate two simpler variants: (i) SingleTask Transfer (STT), which is trained on a single objective of predicting pathogen-human PPI. STT is basically the MTT without the human PPI prediction side task and (ii) Naive Baseline, which is a Logistic regression model using concatenated human and pathogen protein UniRep representations as input.

Implementation details and parameter set up
We use Pytorch [63] to implement our model and run it on an Nvidia GTX 1080-Ti with 11GB memory. We use Adam optimizer for the model parameter optimization. For all datasets, we left out 10% of the training data for validation and performed a grid search for the best combination of parameters on that validation set. For datasets other than Novel H1N1 and Novel Ebola, we perform parameter grid searching with the MLP hidden dimension hid in [8,16,32,64], α in [10 −3 , 10 −2 , 10 −1 , 1], the number of epochs from 0 to 200 with a step of 2 and the learning rate lr in [10 −3 , 10 −2 ]. For the Novel H1N1 and Novel Ebola datasets, we test each with 160 different combinations of negative training and negative testing. Therefore, we fix the hidden dimension to 16, α = 10 −3 , lr = 10 −3 and only perform grid searching on the number of epochs. The reported results for each dataset are the results corresponding to the best-performed model on the validation set. For the Doc2vec model, we use the released code shared by the authors with the given parameters. For the Generalized and Denovo models, we re-implement the methods in Python using all the parameters and feature set as described in the original papers. For Barman and DeepViral, the results are taken from the original papers or calculated from the given model prediction scores.

Evaluation metrics
For all benchmark datasets except the case study, we report five metrics: the Area under Receiver Operating Characteristic curve (AUC) and the area under the precision-recall curve (AP), the Precision, Recall, and F1 scores.
For the case study, we report the topK score with K from 1 to 10. TopK is equal to 1 if the human receptor for SARS-CoV-2 virus appears in the top K proteins that have the highest scores predicted by the model and 0 otherwise.

Result Analysis
In the following four subsections, we provide a detailed comparison of MTT with (i) methods employing hand-crafted input features, (ii) sequence embedding-based methods, (iii) an approach that uses protein domain information, (iv) simpler variants of MTT as ablation studies respectively. All statistical test results present in this section are those from the pair-wise t-test [64] on the F1 scores attained from multiple runs on the same dataset. Figure 2: Comparison between MTT and state-of-the-art methods on small testing datasets. MTT is statistically better than Denovo in all benchmarked datasets. Compared with Generalized, MTT has higher performance in six out of seven datasets. MTT outperforms Doc2vec in four out of seven datasets.

Comparison with methods employing hand-crafted features
Generalized [41] and Denovo [22] are the two traditional methods relying on hand-crafted features extracted from the protein sequences. The number of handcrafted features employed by Denovo and Generalized are 686 and 1,175, respectively. They both employ SVM for the classification task. Since SVM scales quadratically with the number of data points, Denovo and Generalized are not scalable to larger datasets. Figure 2 presents their comparison between MTT on small testing datasets. Detailed scores are given in Table 6 in the Appendix. Results from the two-tailed t-test [65,66] support that MTT significantly outperforms Denovo in all benchmarked datasets with a confidence score of at least 95%. Compared with Generalized, MTT has higher performance in six out of seven datasets (except Denovo slim). The difference is the most significant on the Barman, Zhou's H1N1, and Zhou's Ebola datasets. On Denovo slim dataset, MTT's F1 score is lower than Generalized and only 2% higher than Denovo. This is expected since Denovo slim is a specialized dataset favoring methods using local sequence motif features, which are exploited by Denovo and Generalized.
Hybrid is one recently proposed, deep learning-based method. Despite that, the input features are still manually extracted from the protein sequence. Since the code is not publicly available, we only have the AUC score corresponding to the Zhou's H1N1 dataset, which is also taken from the original paper as listed in table 6. Compared with Hybrid, MTT has higher AUC score. Though comparison on the AUC for one dataset does not bring much insight, we include this method here for completeness.

Comparison with sequence embedding based methods
Doc2vec and MotifTransformer are state-of-the-art methods based on sequence embeddings or representations. Doc2vec utilizes the embeddings learned from the extracted k-mer features while MTT and MotifTransformer employ the embedding directly learned from the amino acid sequences. In addition, MTT is a multitask-based approach that incorporates additional information on human protein-protein interaction into the learning process. Figure 3 shows a comparison in F1 score of MTT and Doc2vec over all benchmarked datasets. Detailed scores are presented in Table 7 in the Appendix. Since the code for the MotifTransformer model is not publicly available, we only have the corresponding results available for the Zhou's H1N1 and Zhou's Ebola datasets, which are also taken from the original paper. '-' denotes the score is not available. Compared with MotifTransformer, MTT has a slightly worse F1 score on Zhou's H1N1 and significantly better F1 score on Zhou's Ebola datasets.
Comparison with Doc2vec. MTT out-performs Doc2vec in 5 out of 9 benchmark datasets, and the performance gap is statistically significant with a p-value smaller than 0.05. MTT is significantly better than Doc2vec on the Novel Ebola dataset, while on the Novel H1N1 dataset, the reverse holds true. Doc2vec outperforms MTT in three testing datasets whose negative samples were drawn from a sequence dissimilarity method. We also note that these datasets might be biased since in the ideal testing scenario, we do not have knowledge about the set of human proteins that interacted with the virus. Therefore, such dissimilarity-based negative sampling is infeasible.

Comparison with methods that use domain information
Barman features set is constructed from the domain-domain association and the hand-crafted feature extracted from the protein sequences. Since the protein domain information is not available for all viral proteins, the Barman method has restricted application. A comparison between Barman and MTT is presented in table 3. Due

Comparison with methods that used GO, taxonomy and phenotype information
DeepViral exploited that disease phenotypes, the viral taxonomies, and proteins' GO annotation to enrich its protein embeddings. Table 4 presents a comparison between MTT and DeepViral on the four datasets released by DeepViral's authors. The reported results on each dataset are the average after five experimental runs for DeepViral and ten experimental runs for MTT. We observe MTT and STT significantly supersede their competitor regarding the averaged F1 score. The gain is more significant on smaller datasets (644788 and 333761)

Ablation Studies
We compare our method with two of its simpler variants: the STT and the Naive baseline baseline models. STT is the MTT model without the human PPI prediction task. Naive baseline concatenates the learned embeddings for the virus and human proteins to form the input to a Logistic Regression model. Figure 4 presents a comparison between the F1 score of MTT and its variants on our benchmarked datasets. Table 8 show all reported scores over all datasets. MTT is significantly better than STT in five out of nine benchmarked and the four DeepViral datasets with a p-value smaller than 0.05. While in the remaining four datasets, the difference is not statistically significant. This confirms that the learned patterns from the human PPI network bring additional benefits to the virus-human PPI prediction task.
Compare with Naive baseline, MTT wins in eight out of nine benchmarked and the four DeepViral datasets. On the remaining dataset (Novel H1N1), the difference is not statistically different. STT significantly outperforms Naive baseline in eight out of nine datasets. This claims the effectiveness of our chosen architecture.

Case study for SARS-CoV-2 binding prediction
The virus binding to cells or the interaction between viral attachment proteins and host cell receptors is the first and decisive step in the virus replication cycle. Identifying the host receptor(s) for a particular virus is often fundamental in unveiling the virus pathogenesis and its species tropism.  Denovo slim, Yersina, and Franci), the difference is not statistically significant. MTT is statistically better than Naive baseline on eight out of nine datasets, while on the remaining dataset(Novel Ebola), the difference is not statistically different.
Here we present a case study for detecting the human protein binding partners for SARS-CoV-2. Our virus-human PPI dataset is retrieved from the InAct Molecular Interaction database [15] (the latest update is 07.05.2021). We retrieve the protein sequences from Uniprot [55]. In the next section, we describe the construction of the training and testing dataset to predict SARS-CoV-2 binding partners.

Training, Validation and Test Sets for Virus-Human PPI
The statistics for our SARS-CoV-2 binding prediction dataset are presented in table 5. We construct the corresponding datasets as follows.
Training Set. As positive interaction samples, we include in the training data only direct interactions between the human proteins and any virus except the SARS-CoV and SARS-CoV-2. Direct interaction requires two proteins to directly bind to each other, i.e. without an additional bridging protein. Moreover, the interacting human protein should be on the cell surface. Without loss of generality, we perform our search for the binding receptor on the set of all human proteins that have a KNOWN direct interaction with any virus and locate to the cell surface. Our surface human protein list consists of all reviewed Uniprot proteins that meet at least one of the following criteria: (i) appears in the human surfacetome [67] list or (ii) has at least one of the following GO annotations [68,69]:{CC-plasma membrane, CC-cell junction}.
The negative samples for training data contain indirect (interactions that are not marked as direct in the database) between the human proteins and any virus except SARS-CoV and SARS-CoV-2. The indirect interactions can be a physical association (two proteins are detected in the same protein complex at the same point of time) or an association in which two proteins that may participate in the formation of one or more physical complexes without additional evidence whether the proteins are directly binding to specific members of such a complex).
Validation and Test Sets. As established in studies [70,71,72], angiotensinconverting enzyme 2 (ACE2) is the human receptor for both SARS-CoV [73] and SARS-CoV-2 viruses [72]. The positive validation and testing set consist of interaction between the known human receptor (ACE2) and the corresponding spike proteins of SARS-CoV and SARS-CoV-2, respectively. Our negative validation and testing set encapsulate of all possible combinations the two viral spike proteins and 52 human proteins that meet our filtering criteria.

The intra human PPI for the Side Task
Since we are interested in only the direct interaction between virus and human proteins, we also customize our intra human PPI training set. Our intra human PPI dataset is also retrieved from the InAct [15] database (the latest update is 07.05.2021). We retain only interactions between two human proteins that appear in the virus-human PPI dataset constructed above. The confidence scores are normalized into the [0,1] ranges. All confidence scores corresponding to "indirect" interactions are set to 0. In the end, our intra-human PPI training set consists of 96,458 interactions between 5,563 human proteins.    Finally, we here evaluate the prediction methods on how effective they are in ranking human protein candidates for binding to an emerging virus envelope protein. Figure 5 presents the methods' performance after ten runs on the case study dataset. TopK is equal to 1 if the true human receptor appears in the top K proteins that correspond to the highest predicted scores by the model and is equal to 0 otherwise. The reported scores plotted in Figure 5 are the average after ten experimental runs with random initialization.
Using this method we find that ACE2, the only SARS-CoV-2 receptor proven in in vivo and in vitro studies [72,74,75], consistently appears as the highest ranked prediction of MTT in each of the ten experimental runs. We observe a significant difference between the highest ranked performance of MTT and its competitors. The performance gain shown by MTT over STT is quite substantial after ten runs and supports the superiority of our multitask framework. The next highest nine hits presented in both models have not been shown to interact with SARS-CoV-2 in in vitro studies. Interestingly, dipeptidyl peptidase 4 (DDP4), a receptor for another betacoronavirus MERS-CoV [76] also scored highly in the MTT method. However, although in silico analysis has speculated a possible interaction [77], it is yet to be shown experimentally. Similarly, the serine protease TMPRSS2, which is required for SARS-CoV-2 S protein priming during entry [72], appeared in position 7 using the Doc2vec model. Finally, aminopeptidase N (ANPEP) the receptor for the common cold coronavirus 229E appeared as first hit in the Doc2vec model [78].
In Figures 6 and 7, we plot the average confidence scores (corresponding to predicted interaction probability) corresponding to top 10 predictions of MTT and Doc2vec models. Specifically, the proteins are ranked based on the average (over 10 runs) confidence scores as predicted by the two models. While for MTT, the receptor ACE2 always occurs at the top of the list with average confidence score of more than 0.70 (which is more than 11% higher than the confidence score assigned to the second hit), Doc2vec assigns it a score of less than 0.44 where ACE2 is ranked 2nd based on average scores. Moreover, there is negligible difference between the prediction scores for ACE2 and the first predicted hit ANPEP in case of

Doc2vec.
These results indicate that MTT can provide high-quality prediction results and can help biologists to restrict the search space for the virus interaction partner effectively. This case study showcases the effectiveness of our method in solving virus-human PPI prediction problem and aims to convince biologists of the potential application of our prediction framework.

Conclusion
We presented a thorough overview of state-of-the-art models and their limitations for the task of virus-human PPI prediction. Our proposed approach exploits powerful statistical protein representations derived from a corpus of around 24 Million protein sequences in a multitask framework. Noting the fact that virus proteins tend to mimic human proteins towards interacting with the host proteins, we use the prediction of human PPI as a side task to regularize our model and improve generalization. The comparison of our method with a variety of state-of-the-art models on several datasets showcase the superiority of our approach. Ablation study results suggest that the human PPI prediction side task brings additional benefits and helps boost the model performance. A case study on the interaction of the SARS-CoV-2 virus spike protein and its human receptor indicates that our model can be used as an effective tool to reduce the search space for evaluating host protein candidates as interacting partners for emerging viruses. In future work, we will enhance our multitask approach by incorporating more domain information including structural protein prediction tools [79] as well as exploiting more complex multitask model architectures. yeast-two hybrid 644788

List of abbreviations
The Influenza A virus taxon ID 33761 The HPV 18 virus taxon ID 697049 The SARS-CoV-2 virus taxon ID 043570 The Zika virus taxon ID

Declarations
Ethics approval and consent to participate Not applicable

Consent for publication Not applicable
Availability of data and materials All the code and data used in this study is publicly available at https://git.l3s.uni-hannover.de/dong/multitask-transfer

Competing interests
The authors declare that they have no competing interests. The funding bodies did not play any role in the design of the study, collection, analysis, interpretation of data, and in writing the manuscript.
Authors' contributions N.D designed the study, collected the data, implemented the models and analyzed the results. G.B and G.G qualitatively validated the design and results of the case study. M.K designed and supervised the study as well as analyzed the results. All authors wrote the manuscript. All authors have read and approved the final manuscript.

A.1 Detailed Results
The following subsections provide detailed experimental results. For the Hybrid and MotifTransformer, the author's code is not available and results are taken from the original paper as the. '-' indicates that the score is not available. For other methods, the reported results are the average after 10 experimental runs. We perform pairwise t-test tests for statistical significance testing. Our presented results are statistically significant with a p-value less than 0.05.
A.2 Comparison with Methods using hand crafted protein features Table 6 provides a comparison between MTT and baselines which employ handcrafted features. MTT outperfroms Denovo in all benchmarked datasets while MTT supersede Generalized in six out of the seven datasets. The performance gains are statistically significant with a p-value of 0.05.  Table 7 provides a comparison between MTT and embedding-based methods on small testing datasets. MTT outperforms Doc2vec in 4 datasets. The performance gains are statistically significant with a p-value of 0.05. MTT is outperformed by Doc2vec in three datasets: Zhou's Ebola, Zhou's H1N1 and Denovo slim. We point out that these datasets are quite specialized where the negative training and testing samples were drawn from a sequence dissimilarity negative sampling technique. In particular, the protein sequences for negative test set were already chosen based on their dissimilarity to those is positive set. This is not a realistic setting when the positive test set is in itself unknown. Nevertheless, the performance of MTT is comparable to the state of the art on these especially curated datasets too while it outperforms all methods on more general datasets.  In Table 8 we compare MTT with its simpler variants for 11 datasets. The reported results are average after 10 runs. Results from pair-wise t-test show that that (i)MTT is significantly better than Naive baseline in all datasets with p-value smaller than 0.05, (ii) MTT is significantly better than STT in 4 out of 7 datasets with p-value smaller than 0.05 while for the remaining datasets, the difference is not statistically significant.